Fix schema option not working #946

payala · 2025-03-13T21:06:21Z

Adding a pydantic schema to SmartScrapeGraph was not working because the format instructions were being appended to the prompt and that was breaking the prompt template variable parsing.

This "IMPORTANT: " appended text is removed, since the format_instructions are anyway added to the prompt being passed as variables, and this is what is breaking the prompt when a schema is passed.

This is my first contribution to this project, I tried to follow all the guidelines, let me know if there is something I should do differently please.

Pre/beta

## [1.41.0](ScrapeGraphAI/Scrapegraph-ai@v1.40.1...v1.41.0) (2025-03-09) ### Features * add CLoD integration ([4e0e785](ScrapeGraphAI@4e0e785)) ### Test * Add coverage improvement test for tests/test_generate_answer_node.py ([6769c0d](ScrapeGraphAI@6769c0d)) * Add coverage improvement test for tests/test_models_tokens.py ([b21e781](ScrapeGraphAI@b21e781)) * Update coverage improvement test for tests/graphs/abstract_graph_test.py ([f296ac4](ScrapeGraphAI@f296ac4)) ### CI * **release:** 1.41.0-beta.1 [skip ci] ([7bfe494](ScrapeGraphAI@7bfe494))

## [1.42.0](ScrapeGraphAI/Scrapegraph-ai@v1.41.0...v1.42.0) (2025-03-10) ### Features * update terms ([ff7b33b](ScrapeGraphAI@ff7b33b))

## [1.42.1](ScrapeGraphAI/Scrapegraph-ai@v1.42.0...v1.42.1) (2025-03-12) ### Bug Fixes * add new gpt model ([cff799b](ScrapeGraphAI@cff799b))

## [1.43.0](ScrapeGraphAI/Scrapegraph-ai@v1.42.1...v1.43.0) (2025-03-13) ### Features * add intrgration for o3min ([fc0a148](ScrapeGraphAI@fc0a148))

codebeaver-ai · 2025-03-13T21:33:07Z

I opened a Pull Request with the following:

🔄 4 test files added and 7 test files updated to reflect recent changes.
🐛 Found 1 bug
🛠️ 94/133 tests passed

🔄 Test Updates

I've added or updated 8 tests. They all pass ☑️
Updated Tests:

tests/nodes/fetch_node_test.py 🩹

Fixed: tests/nodes/fetch_node_test.py::test_fetch_json
tests/nodes/fetch_node_test.py 🩹

Fixed: tests/nodes/fetch_node_test.py::test_fetch_xml
tests/nodes/fetch_node_test.py 🩹

Fixed: tests/nodes/fetch_node_test.py::test_fetch_csv
tests/nodes/fetch_node_test.py 🩹

Fixed: tests/nodes/fetch_node_test.py::test_fetch_txt
tests/graphs/abstract_graph_test.py 🩹

Fixed: tests/graphs/abstract_graph_test.py::TestAbstractGraph::test_create_llm[llm_config5-ChatBedrock]
tests/graphs/abstract_graph_test.py 🩹

Fixed: tests/graphs/abstract_graph_test.py::TestAbstractGraph::test_create_llm_with_rate_limit[llm_config5-ChatBedrock]
tests/utils/test_proxy_rotation.py 🩹

Fixed: tests/utils/test_proxy_rotation.py::test_parse_or_search_proxy_success

New Tests:

tests/test_generate_answer_node.py

🐛 Bug Detection

Potential issues:

scrapegraphai/utils/research_web.py
The error is occurring in the test_google_search function. The test is expecting exactly 2 results from the search_on_web function, but it's receiving 4 results instead. This mismatch is causing the assertion to fail.
Let's break down the problem:

The test is calling search_on_web("test query", search_engine="duckduckgo", max_results=2).
The function is expected to return 2 results (as specified by max_results=2).
However, the function is actually returning 4 results.
This suggests that the search_on_web function is not correctly limiting the number of results to the specified max_results parameter when using the DuckDuckGo search engine.
The issue is likely in the implementation of the DuckDuckGo search in the search_on_web function. Specifically, in this part of the code:

if search_engine == "duckduckgo":
    research = DuckDuckGoSearchResults(max_results=max_results)
    res = research.run(query)
    results = re.findall(r"https?://[^\s,\]]+", res)

The DuckDuckGoSearchResults object is created with the correct max_results, but the results are then extracted using a regex pattern. This regex extraction might not be respecting the max_results limit.
To fix this, the code should explicitly limit the number of results after the regex extraction:

results = re.findall(r"https?://[^\s,\]]+", res)[:max_results]

This change would ensure that no more than max_results URLs are returned, regardless of how many are found by the regex.

Test Error Log

tests/utils/research_web_test.py::test_google_search: def test_google_search():
        """Tests search_on_web with Google search engine."""
>       results = search_on_web("test query", search_engine="Google", max_results=2)
tests/utils/research_web_test.py:10: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
query = 'test query', search_engine = 'google', max_results = 2, port = 8080
timeout = 10, proxy = None, serper_api_key = None, region = None
language = 'en'
    def search_on_web(
        query: str,
        search_engine: str = "duckduckgo",
        max_results: int = 10,
        port: int = 8080,
        timeout: int = 10,
        proxy: str | dict = None,
        serper_api_key: str = None,
        region: str = None,
        language: str = "en",
    ) -> List[str]:
        """Search web function with improved error handling and validation
    
        Args:
            query (str): Search query
            search_engine (str): Search engine to use
            max_results (int): Maximum number of results to return
            port (int): Port for SearXNG
            timeout (int): Request timeout in seconds
            proxy (str | dict): Proxy configuration
            serper_api_key (str): API key for Serper
            region (str): Country/region code (e.g., 'mx' for Mexico)
            language (str): Language code (e.g., 'es' for Spanish)
        """
    
        # Input validation
        if not query or not isinstance(query, str):
            raise ValueError("Query must be a non-empty string")
    
        search_engine = search_engine.lower()
        valid_engines = {"duckduckgo", "bing", "searxng", "serper"}
        if search_engine not in valid_engines:
>           raise ValueError(f"Search engine must be one of: {', '.join(valid_engines)}")
E           ValueError: Search engine must be one of: searxng, duckduckgo, serper, bing
scrapegraphai/utils/research_web.py:45: ValueError

☂️ Coverage Improvements

Coverage improvements by file:

tests/nodes/fetch_node_test.py

New coverage: 71.30%
Improvement: +71.30%
tests/graphs/abstract_graph_test.py

New coverage: 71.88%
Improvement: +71.88%
tests/utils/test_proxy_rotation.py

New coverage: 0.00%
Improvement: +0.00%
tests/test_generate_answer_node.py

New coverage: 85.71%
Improvement: +8.73%

🎨 Final Touches

I ran the hooks included in the pre-commit config.

_{Settings | Logs | CodeBeaver}

VinciGit00 · 2025-03-17T09:28:27Z

HI @payala,

could you please add a screenshot of results?

payala · 2025-03-21T07:33:43Z

You mean this @VinciGit00 ?

VinciGit00 · 2025-03-21T07:36:39Z

Yes thx

github-actions · 2025-03-21T08:19:13Z

🎉 This PR is included in version 1.43.1-beta.1 🎉

The release is available on:

v1.43.1-beta.1
GitHub release

Your semantic-release bot 📦🚀

github-actions · 2025-03-21T08:26:30Z

🎉 This PR is included in version 1.43.1 🎉

The release is available on:

v1.43.1
GitHub release

Your semantic-release bot 📦🚀

VinciGit00 and others added 14 commits March 9, 2025 15:09

Merge pull request ScrapeGraphAI#943 from ScrapeGraphAI/pre/beta

c4cca9a

Pre/beta

feat: update terms

ff7b33b

Merge branch 'main' of https://github.com/ScrapeGraphAI/Scrapegraph-ai

18dbd05

ci(release): 1.42.0 [skip ci]

d0efe1b

## [1.42.0](ScrapeGraphAI/Scrapegraph-ai@v1.41.0...v1.42.0) (2025-03-10) ### Features * update terms ([ff7b33b](ScrapeGraphAI@ff7b33b))

fix: add new gpt model

cff799b

Merge branch 'main' of https://github.com/ScrapeGraphAI/Scrapegraph-ai

8121c7b

ci(release): 1.42.1 [skip ci]

2c92da0

## [1.42.1](ScrapeGraphAI/Scrapegraph-ai@v1.42.0...v1.42.1) (2025-03-12) ### Bug Fixes * add new gpt model ([cff799b](ScrapeGraphAI@cff799b))

feat: add intrgration for o3min

fc0a148

Merge branch 'main' of https://github.com/ScrapeGraphAI/Scrapegraph-ai

dba4a8f

ci(release): 1.43.0 [skip ci]

db3afad

## [1.43.0](ScrapeGraphAI/Scrapegraph-ai@v1.42.1...v1.43.0) (2025-03-13) ### Features * add intrgration for o3min ([fc0a148](ScrapeGraphAI@fc0a148))

Add a test that reproduces the error

03c89b0

Fix schema option not working

562a97c

fix: Fixes schema option not working

df1645c

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working tests Improvements or additions to test labels Mar 13, 2025

codebeaver-ai bot mentioned this pull request Mar 13, 2025

Fix schema option not working - Unit Tests #947

Closed

VinciGit00 changed the base branch from main to pre/beta March 21, 2025 08:17

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 21, 2025

VinciGit00 approved these changes Mar 21, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 21, 2025

VinciGit00 merged commit 16de81f into ScrapeGraphAI:pre/beta Mar 21, 2025
4 checks passed

github-actions bot added the released on @dev label Mar 21, 2025

github-actions bot added the released on @stable label Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix schema option not working #946

Fix schema option not working #946

Uh oh!

payala commented Mar 13, 2025

Uh oh!

codebeaver-ai bot commented Mar 13, 2025

Uh oh!

VinciGit00 commented Mar 17, 2025

Uh oh!

payala commented Mar 21, 2025

Uh oh!

VinciGit00 commented Mar 21, 2025

Uh oh!

Uh oh!

github-actions bot commented Mar 21, 2025

Uh oh!

github-actions bot commented Mar 21, 2025

Uh oh!

Uh oh!

Uh oh!

Fix schema option not working #946

Fix schema option not working #946

Uh oh!

Conversation

payala commented Mar 13, 2025

Uh oh!

codebeaver-ai bot commented Mar 13, 2025

🔄 Test Updates

🐛 Bug Detection

☂️ Coverage Improvements

🎨 Final Touches

Uh oh!

VinciGit00 commented Mar 17, 2025

Uh oh!

payala commented Mar 21, 2025

Uh oh!

VinciGit00 commented Mar 21, 2025

Uh oh!

Uh oh!

github-actions bot commented Mar 21, 2025

Uh oh!

github-actions bot commented Mar 21, 2025

Uh oh!

Uh oh!